Article Title

ثبت نشده
چکیده

Given an unlabeled protein sequence S and a known superfamily F, we wish to determine whether or not S belongs to F. We refer to F as the target class and the set of sequences not in F as the non-target class. In general, a superfamily is a group of proteins that share similarities in structure and/or function. If the unlabeled sequence S is detected to belong to F, then one can infer the structure and function of S. This process is important for example in drug discovery. If sequence S is obtained from some disease X and it is determined that S belongs to the superfamily F, then one may try a combination of existing drugs for F to treat the disease X [1]. We developed a hybrid protein classification system based on neural networks, Hidden Markov Models and fuzzy logic, and we tested this system on sample proteins found in the Bacillus Subtilis. This paper explains the classification system, and also the results of our experiments on the Bacillus Subtilis. Introduction The Bacillus Subtilis is a bacterial microorganism that is commonly found in the environment rather than in humans, yet is well known by modern science to be very friendly to the human system. It can promote dramatic healing benefits in humans, even though it isn't one of the native microbes that normally inhabit the human body. This bacteria is so strong that it practically cannibalizes all harmful microorganisms in the human body. Furthermore, the cell wall components of ingested Bacillus Subtilis are able to activate nearly all types of human antibodies, which are highly effective against many of the harmful viruses, fungi and bacterial pathogens which regularly attempt to invade and infect the human system [2]. Therefore, the proteins of this organism could potentially be used to infer drugs that could act as triggers for human antibodies. Prior Work in Protein Classification Protein classifiers are used to predict homology among protein sequences. Homology refers to protein sequences that share the same function. Protein classifiers fall under two broad categories: explicit classifiers and implicit classifiers. Explicit classifiers analyze protein sequences primarily through the process of multiple sequence alignment [3]. No transformation is applied on the protein sequences. Multiple alignment tools as well as Hidden Markov Models fall under this category. Multiple alignment tools attempt to identify common regions of similarity across a collection of protein sequences known to be of similar function. Such a collection is known as a protein family. A multiply-aligned collection is used as a reference, against which an unknown sequence is aligned against. The larger the region of similarity shared between the unknown sequence and the multiply-aligned collection, the higher a score is returned. The drawback of multiple sequence alignment is that different scoring schemes will generate different "optimum" alignments. Furthermore, to generate better alignments, artificial insertions are made on the protein sequences. Another approach is with Hidden Markov Models (HMM) [4]. A Hidden Markov Model is built to represent an aligned collection. The aligned collection on which the HMM is built upon is called a consensus. An unknown sequence is then fed to the model, and the probability of the unknown sequence to be a member of the consensus is determined. Although probability values are more meaningful than arbitrary score values, however, the alignment of the consensus on which the HMM was built has to be first arrived at. The method with which the consensus was arrived at, plays a major part in determine the shape of the HMM. Therefore, 04/07/2006 12:09 AM Article Title Page 2 of 4 file:///Users/pbach/Desktop/xRoads/13-1/manuscript/An%20Experimental...ssifier%20to%20Analyze%20Bacillus%20Subtilis%20Proteins_NoName.html indirectly, arbitrary scoring still has an influence on the outcome of the HMM. To overcome the problem of arbitrary scoring schemes, neural networks are used. Neural Networks in Protein Classification Neural networks are implicit classifiers [5]. A protein sequence has to be encoded into a list of real values in order for it to be processable by a neural network. The encoding scheme used should be designed in such a way as to be able to bring out implicit similarities in a given protein family. Furthermore, encoding allows us to do away with arbitrary scoring schemes. An example of a toy encoding scheme is 2-gram encoding. The 2gram encoding method counts the frequency of every possible unique pairings. For example, given the letters A, B and C, we have 9 (3 to the power of 2) unique pairings (AA, AB, AC, BA, BB, BC, CA, CB, CC). Each unique pairing is assigned to a particular neural network input node. The frequency for a particular pair is calculated by dividing the number of occurrences for that pair, by the total number of pairs found in a given sequence. An N-lettered sequence will have a total number of (N – 1) pairs. 20 amino acids – each represented by an alphabet – form the building blocks of protein sequences. When we apply the 2-gram encoding method, our perceptron will have 400 (20 to the power of 2) input nodes. The problem with using an encoding scheme such as 2-gram encoding is that local similarities such as motifs may be lost. A motif is a pattern which recurs across protein sequences. In order to overcome this problem, we devised an experimental hybrid protein classification method. Hybrid Protein Classifier Our experimental hybrid classifer combines the use of neural networks, together with HMM and fuzzy logic. One neural network is used to abstract one protein family. The neural network for a given protein family is trained to response with a +1 when presented with sequences from its assigned family. At the start of the training, the weights of the neural network is defaulted to zero, in order for it to give an initial default response of zero. A HMM is used to capture a particular local similirity (i.e. motif). The presence of several motifs in a given protein family will require the use of several HMMs. The probability outputs from these HMMs will be fed into the neural network as a bias. Our system will eventually have an array of neural network HMM hybrids, each hybrid representing a particular protein family. Not every hybrid will be able to output +1. Training will plateau at different levels for different protein families. Fuzzy logic is then used to determine HI, MED and LO levels (i.e. fuzzy classes) for each protein family. Different families will have different fuzzy class boundaries. Experimental Results For our experiment, we used perceptrons to abstract 8 protein families: Ras Ribitol Ferritin Kinase Cytochrome b5 Cytochrome c Cytokines Acid proteases After training, we derived these fuzzy class boundaries for each family's perceptron output (TABLE 1): 04/07/2006 12:09 AM Article Title Page 3 of 4 file:///Users/pbach/Desktop/xRoads/13-1/manuscript/An%20Experimental...ssifier%20to%20Analyze%20Bacillus%20Subtilis%20Proteins_NoName.html Protein family LO MED HI Acid proteases 0.00-0.64 0.65-0.94 0.95-1.00 Kinase 0.00-0.59 0.60-0.83 0.84-1.00 Cytochrome b5 0.00-0.49 0.50-0.84 0.85-1.00 Cytochrome c 0.00-0.59 0.60-0.72 0.73-1.00 Cytokines 0.00-0.59 0.60-0.89 0.90-1.00 Ferritin 0.00-0.59 0.60-0.74 0.75-1.00 Ras 0.00-0.69 0.70-0.83 0.84-1.00 Ribitol 0.00-0.29 0.30-0.59 0.60-1.00 TABLE 1: Fuzzy class boundaries. We downloaded 7 Bacillus Subtilis proteim samples from Pfam. 4 of those samples are proteins of unknown function, and the other 3 are of known function: DUF1002 (unknown function) DUF1021 (unknown function) DUF1027 (unknown function) DUF1054 (unknown function) Kinase (known function, and abstracted) Asparaginase (known function, but not abstracted) Permease (known function, but not abstracted) We fed each Bacillus Subtilis sample into each perceptron, and obtained the following outputs (TABLE 2): Sample Acid Protease Kinase Cyto b5 Cyto c Cytokine Ferritin Ras Ribitol DUF1002 0.79 0.81 0.35 0.60 0.77 0.73 0.79 0.39 DUF1021 0.77 0.83 0.33 0.55 0.82 0.70 0.78 0.36 DUF1027 0.60 0.73 0.35 0.52 0.74 0.65 0.65 0.39 DUF1054 0.60 0.77 0.37 0.47 0.79 0.68 0.68 0.32 Kinase 0.80 0.86 0.38 0.57 0.76 0.66 0.78 0.45 Asparaginse 0.73 0.79 0.36 0.52 0.69 0.62 0.73 0.36 Permease 0.86 0.87 0.31 0.54 0.71 0.64 0.77 0.57 TABLE 2: Perceptron outputs. We then applied the fuzzy classes in TABLE 1 on our perceptron outputs, in order to derived the results in TABLE 3: Sample Acid Protease Kinase Cyto b5 Cyto c Cytokine Ferritin Ras Ribitol DUF1002 MED MED LO MED MED MED MED MED DUF1021 MED MED LO LO MED MED MED MED 04/07/2006 12:09 AM Article Title Page 4 of 4 file:///Users/pbach/Desktop/xRoads/13-1/manuscript/An%20Experimental...ssifier%20to%20Analyze%20Bacillus%20Subtilis%20Proteins_NoName.html DUF1027 LO MED LO LO MED MED LO MED DUF1054 LO MED LO LO MED MED LO MED Kinase MED HI LO LO MED MED MED MED Asparaginse MED MED LO MED MED MED MED MED Permease MED HI LO LO MED MED MED MED TABLE 3: Fuzzy classes applied on perceptron outputs. As shown in TABLE 3, the system successfully rejected all the proteins of unknown function (i.e. no HI score for any unknown proteins). The system also managed to successfully detect the kinase sample. The kinase family was abstracted by the system. It also successfully rejected the asparaginase sample. Asparaginase is a known protein, but it was not abstracted by the system. Therefore, it was correctly rejected. However, the permease sample was incorrectly labeled as belonging to the kinase family. Out of 8 abstracted families, only one returned on error. The margin of error for this misclassification is small enough for manual correction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Structures in Research Article Titles from Three Different Disciplines: Applied Linguistics, Civil Engineering, and Dentistry

Deducing what a paper is about, titles are considered as the most important determinant of how many people will read the article. Therefore, studying the use of different syntactic structures and their rhetorical functions in titles is of great significance. The current study was set to investigate these structures used in research article titles in three disciplines of Applied Linguistics, Den...

متن کامل

Are mainstream support services meeting the needs of sexual minority women with breast cancer? An exploration of the perspectives and experiences of users of an online support forum

TYPE: Article CC:CCG JOURNAL TITLE: Journal of gay & lesbian social services USER JOURNAL TITLE: Journal of Gay & Lesbian Social Services ARTICLE TITLE: Are mainstream support services meeting the needs of sexual minority women with breast cancer? An exploration of the perspectives and experiences of users of an online support forum ARTICLE AUTHOR: VOLUME: 28 ISSUE: 4 MONTH: YEAR: 2016 PAGES: I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006